Data Analyzer using Unity Catalog
In the data analyzer stage, you perform analysis of the complete dataset based on selected constraints. For this you must add the data analyzer node to the data quality stage and then create a data analyzer job.
Prerequisites
You must complete the following prerequisites before creating a data analyzer job:
-
The data quality nodes have specific requirements as far as the Databricks Runtime version of the cluster and access mode is concerned. Following are the requirements for Unity Catalog-enabled Databricks used as a data analyzer node in the data pipeline:
Data Quality Node Databricks Cluster Runtime Version Access Mode Data Analyzer 12.2 LTS Dedicated
-
Access to a Databricks Unity Catalog node which will be used as a data lake in the data ingestion pipeline.
Creating a data analyzer job
-
On the home page of Data Pipeline Studio, click the data analyzer node.
-
Click the issue resolver node and click Create Job.
- On the Unity Catalog Analyzer Job tab, click Create Job.
-
Complete the following steps to create the job:
Job Name
Provide job details for the data analyzer job:
-
Template - Based on the source and destination that you choose in the data pipeline, the template is automatically selected.
-
Job Name - Provide a name for the data analyzer job.
-
Node rerun Attempts - Specify the number of times the pipeline rerun is attempted on this node of the pipeline, in case of failure. The default setting is done at the pipeline level. You can change the rerun attempts by selecting 1, 2, or 3.
Source
-
Source - This is automatically selected depending on the data lake node configured and added in the pipeline.
-
Datastore - The configured datastore that you added in the data pipeline is displayed.
-
Catalog Name -The catalog which is associated with the configured datastore is displayed.
-
Schema Name - The schema associated with the catalog is displayed. The schema is selected based on the catalog, but you can select a different schema.
-
Source Table - Select a table from the selected datastore.
-
Data Processing Type - Select the type of processing that must be done for the data. Choose from the following options:
Delta
In this type of processing, incremental data is processed. For the first job run, the complete data is considered. For subsequent job runs, only delta data is considered.
-
Based On - The delta processing is done based on the following options:
Option Description Table Versions (Change Data Feed) Unity Catalog stores data in delta format, this feature is called Change Feed Data. Unity Catalog creates versions of tables when there is a change in data. Select this option if you are using the delta format. Select column from the dropdown list. Table Columns Unity Catalog also supports tables like CSV, JSON, Parquet where data is stored in
tables. Select this option if you are not using the delta format.
-
Table Versions - If you select this option, provide Unique Identifier Columns by selecting one or more columns the Delta table that uniquely identify each record and can be used as a reference to retrieve the latest version of the data.
-
Table Columns - If you select this option, provide Unique Identifier Columns by selecting a column with unique records. This column is used to create versions, if multiple versions of the same record exist.
Full
In this type of processing the complete dataset is considered for processing in each job run.
-
-
Constraint - Select the constraint that you want to run on the data in the analyzer job.
-
Column - Select a column on which you want to run the constraint.
-
Click Add.
-
After you have added the required constraints click Add to create a summary of the selected data processing options.
-
You can perform the following actions:
-
Click the constraints column to open the side drawer and review the added constraints,
-
Click the ellipsis (...) and edit or delete the constraints.
-
Click Next.
Target
-
Target - The data lake that was configured for the target is auto-populated.
-
Datastore - The datastore that you configured for the Databricks Unity Catalog is auto-populated.
-
Catalog Name - The catalog name that is associated with the Unity Catalog instance is auto-populated.
-
Schema - The schema that is associated with the catalog is auto-populated. If required you can change the schema at this stage.
-
Map source data to target tables - Map the source file with a table in the target. You can either map an existing table or create a new table and map it.
Do one of the following:
-
Source - Select a table from the dropdown.
-
Target - Select a table from the dropdown. Else type a name for a new table and click Click to create "table name".
-
Click Map Table.
To delete a mapping click the ellipsis (...) and then click Delete.
Click Next.
Cluster Configuration
You can select an all-purpose cluster or a job cluster to run the configured job.
In case your Databricks cluster is not created through the Calibo Accelerate platform and you want to update custom environment variables, refer to the following:
All Purpose Clusters
Cluster - Select the all-purpose cluster that you want to use for the data quality job, from the dropdown list.
Job Cluster
Cluster Details Description Choose Cluster Provide a name for the job cluster that you want to create. Job Configuration Name Provide a name for the job cluster configuration. Databricks Runtime Version Select the appropriate Databricks version. Worker Type Select the worker type for the job cluster. Workers Enter the number of workers to be used for running the job in the job cluster.
You can either have a fixed number of workers or you can choose autoscaling.
Enable Autoscaling Autoscaling helps in scaling up or down the number of workers within the range specified by you. This helps in reallocating workers to a job during its compute-intensive phase. Once the compute requirement reduces the excess number of workers are removed. This helps control your resource costs. Cloud Infrastructure Details First on Demand Provide the number of cluster nodes that are marked as first_on_demand.
The first_on_demand nodes of the cluster are placed on on-demand instances.
Availability Choose the type of EC2 instances to launch your Apache Spark clusters, from the following options:
-
Spot
-
On-demand
-
Spot with fallback
Zone Identifier of the availability zone or data center in which the cluster resides.
The provided availability zone must be in the same region as the Databricks deployment.
Instance Profile ARN Provide an instance profile ARN that can access the target Amazon S3 bucket. EBS Volume Type The type of EBS volume that is launched with this cluster. EBS Volume Count The number of volumes launched for each instance of the cluster. EBS Volume Size The size of the EBS volume to be used for the cluster. Additional Details Spark Config To fine tune Spark jobs, provide custom Spark configuration properties in key value pairs. Environment Variables Configure custom environment variables that you can use in init scripts. Logging Path (DBFS Only) Provide the logging path to deliver the logs for the Spark jobs. Init Scripts Provide the init or initialization scripts that run during the start up of each cluster. Notifications
You can configure the SQS and SNS services to send notifications related to the node in this job. This provides information about various events related to the node without actually connecting to the Calibo Accelerate platform.
SQS and SNS Configurations - Select an SQS or SNS configuration that is integrated with the Calibo Accelerate platform. Events - Enable the events for which you want to enable notifications:
-
Select All
-
Node Execution Failed
-
Node Execution Succeeded
-
Node Execution Running
-
Node Execution Rejected
Event Details - Select the details of the events from the dropdown list, that you want to include in the notifications. Additional Parameters - Provide any additional parameters that are to be added in the SQS and SNS notifications. A sample JSON is provided, you can use this to write logic for processing the events. Running the data analyzer job
You can run the data analyzer job in multiple ways:
-
Publish the pipeline and click Run Pipeline.
-
Click the data analyzer node and click Start to initiate the Unity Catalog Data Analyzer Job run.
Viewing the results of the data anazlyer job
-
-
After the job run is successful, click the Unity Catalog Analyzer Result tab.
-
Click View Analyzer Results.
-
On the Output of Analyzer Runner screen, the SQL warehouse associated with this Unity Catalog instance is preselected. Select a job run from the Run Details dropdown list.
-
You can view the results of the selected data analyzer job run. Click
to download the results in a CSV file.
What's next? Data Issue Resolver using Unity Catalog |